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Abstract 

In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In 
each round it chooses from a time-invariant set of alternatives and receives the payoff associated with 
this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work 
has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to 
assume extra structure in order to make the problem tractable. In particular, recent literature considered 
information on similarity between arms. 

We consider similarity information in the setting of contextual bandits, a natural extension of the 
basic MAB problem where before each round an algorithm is given the context - a hint about the payoffs 
in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of 
the crucial problems in sponsored search. A particularly simple way to represent similarity information 
in the contextual bandit setting is via a similarity distance between the context-arm pairs which gives an 
upper bound on the difference between the respective expected payoffs. 

Prior work on contextual bandits with similarity uses "uniform" partitions of the similarity space, so 
that, essentially, each context-arm pair is approximated by the closest pair in the partition. Algorithms 
based on "uniform" partitions disregard the structure of the payoffs and the context arrivals. This is 
potentially wasteful, as it may be advantageous to maintain a finer partition in high-payoff regions of 
the similarity space and popular regions of the context space. In this paper, we design algorithms that 
are based on adaptive partitions which are adjusted to the popular and high-payoff regions, and thus can 
take advantage of the problem instances with "benign" payoffs or context arrivals. 

ACM Categories and subject descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: 
Nonnumerical Algorithms and Problems; F.1.2 [Computation by Abstract Devices]: Modes of Computa- 
tion — Online computation 

General Terms: Algorithms,Theory. 
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1 Introduction 



In a multi-armed bandit problem (henceforth, "multi-armed bandit" will be abbreviated as MAB), an algo- 
rithm is presented with a sequence of trials. In each round, the algorithm chooses one alternative from a set 
of alternatives (arms) based on the past history, and receives the payoff associated with this alternative. The 
goal is to maximize the total payoff of the chosen arms. The MAB setting has been introduced in 1952 by 
Robbins [31] and studied intensively since then in Operations Research, Economics and Computer Science, 
e.g. see [|9j[T6l[TT][8l. This setrm 8 * s use d to model the trade-off between exploration and exploitation, a 
crucial issue in sequential decision-making under uncertainty. 

One standard way to evaluate the performance of a bandit algorithm is regret, defined as the difference 
between the expected payoff of an optimal aim and that of the algorithm. By now the MAB problem with a 
small finite set of arms is quite well understood (e.g. see |[26l l4l[3l). However, if the arms set is exponentially 
or infinitely large, the problem becomes intractable unless we make further assumptions about the problem 
instance. Essentially, a bandit algorithm needs to find a needle in a haystack; for each algorithm there are 
inputs on which it performs as badly as random guessing. 

The bandit problems with large arm sets have been studied in0ElBllSlESl|2SlEDIISlIi21II31IHllSllI51 
l24l . The common theme in these works is to assume a certain structure on payoff functions. Assumptions of 
this type are natural in many applications, and often lead to efficient learning algorithms [21 J. In particular, 
the line of work (2l[20l[T2l[5l|30l|25l|2^[24l|T0l assumes that some information on similarity between arms 
is available to the algorithm. 

In this paper we consider similarity information in the setting of contextual bandits ll32l l30l l27l . a 
natural extension of the basic MAB problem where before each round an algorithm is given the context - 
a hint about the payoffs in this round. Contextual bandits are directly motivated by the problem of placing 
advertisements on webpages, one of the crucial problems in sponsored search. One can cast it as a bandit 
problem so that arms correspond to the possible ads, and payoffs correspond to the user clicks. Then the 
context consists of information about the page, and perhaps the user this page is served to. Furthermore, 
we assume that similarity information is available on both the context and the arms. Following the work 
in [f2l[20l|5]|24l on tne (non-contextual) bandits, a particularly simple way to represent similarity information 
in the contextual bandit setting is via a similarity distance between the context-arm pairs, gives an upper 
bound on the difference between the corresponding payoffs. 

Our model: contextual bandits with similarity information. For simplicity, let us develop the model 
for the stochastic (non-adversarial) case only. Let X be the context set and Y be the arms set. In each 
round t, the following events happen in succession: first, the current context arrival xt G X is chosen by an 
adversary and revealed to the algorithm, then the algorithm chooses an arm y t G Y, and finally the payoff 
Kt{ x uVt) £ [0,1] is realized. Here ir t : X x Y — ► [0, 1] is the payoff function defined as an independent 
random sample from some fixed distribution II over functions X x Y — > [0,1]. Both the contexts (xt)teN 
and the distribution II are fixed by an adversary before the first round and are not revealed to the algorithm. 

The performance of a contextual bandit algorithm is benchmarked in terms of the context-specific best 
arm. Denoting p,(x, y) = E,[irt(x, y)], the contextual regret is defined as 

R(T) = Y%=x V( x t,yt) - lf(xt), where fj*(x) = sup y& n(x,y). 

The similarity information is given to an algorithm as two metric spaces (X, T>x) and (Y,T>y) called, 
respectively, the context space and the arms spaced such that the following Lipschitz condition holds: 

\fi{x,y)- fi(x',y')\ <V x (x,x')+V Y (y,y'). (1) 

'The absence of similarity information on arms is modeled as T>y — 1. 
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It will be useful to consider the product meuic space (X x Y, T>x + T>y), which we call the similarity 
space. Without loss of generality, the distances in (X,T>x) and (Y,T>y) are truncated at 1; to guarantee the 
existence of a context-specific best arm, we will assume that both metric spaces are compact. 

We term this problem the contextual MAB problem. Note that if the context xt is time-invariant, the 
contextual MAB problem reduces to the stochastic MAB problem [3], or (when the metric similarity in- 
formation is present) to the Lipschitz MAB problem [24]; the contextual regret reduces to the "standard" 
(context-free) regret. In general, the context-specific best arm is a more demanding benchmark than the best 
arm used in the "standard" (context-free) definition of regret. 

Uniform vs adaptive partitions. Hazan and Megiddo lfT8l suggest (for a somewhat different setting) an 
algorithm that chooses a "uniform" partition Sx of the context space (namely, an r-net) and approximates 
Xt by the closest point in Sx, call it x' t . Specifically, the algorithm creates an instance A(x) of some bandit 
algorithm A for each point x G Sx, and invokes A(x' t ) in each round t. The granularity of the partition is 
adjusted to the time horizon, the context space, and the black-box regret guarantee for A. 

Furthermore, Kleinberg ll20l provides a bandit algorithm A for the adversarial MAB problem on a 
metric space that has a similar flavor: pick a "uniform" partition Sy of the arms space, and run a fc-arm 
bandit algorithm such as EXP3 H on the points in Sy. Again, the granularity of the partition is adjusted to 
the time horizon, the arms space, and the black-box regret guarantee for EXP3 . 

For our setting, putting these two ideas together results in an algorithm with contextual regret 

R{T) < 0{T l - l K 2+d * +d ^){\ogT), (2) 

where dx is the covering dimension of the context space and dy is that of the arms space. (This holds even 
for adversarial payoffs.) We will call it the uniform algorithm. 

Using "uniform" partitions disregards the structure of the payoffs and the context arrivals. This is po- 
tentially wasteful, as it may be advantageous to maintain a finer partition in higher-paying regions of the 
similarity space, and more popular regions of the context space. The central topic in this paper is designing 
algorithms that are based on adaptive partitions which are adjusted to the popular/ high-payoff regions, 
and thus can take advantage of the problem instances in which the expected payoffs or the context arrivals 
are, in some sense, "benign" ("low-dimensional"). A similar task has been accomplished in ll24l for the 
context-free setting; however, the present contextual setting is considerably more challenging. 

Main contribution: the contextual zooming algorithm. We consider the contextual MAB problem, and 
present an algorithm, called the contextual zooming algorithm, which "zooms in" on both the "popular" 
regions on the context space and the high-payoff regions of the arms space. The algorithm considers the 
context space and the arms space, jointly - it maintains a partition of the similarity space, rather than one par- 
tition for contexts and another for arms - which allows it to use the similarity information more efficiently. 
We develop provable guarantees that capture the "benign-ness" of the context arrivals and the expected pay- 
offs. In the worst case, we match the guarantee (O for the uniform algorithm. For the context-free setting, 
our guarantees match those in [24 J. We obtain nearly matching lower bounds using the KL-divergence 
techniques from H |20l |24]]. Our algorithm and analysis extends to a more general setting where some 
context-arms pairs may be unfeasible, and moreover the right-hand side of (Q~|) is replaced by an arbitrary 
metric on the feasible context-arms pairs (which is revealed to the algorithm). 

We apply the contextual zooming algorithm to a (context-free) adversarial MAB problem in which an 
adversary is constrained to change the expected payoffs of each arm gradually, e.g. by a small amount in 
each round. This setting is naturally modeled as a contextual MAB problem in which the i-th context arrival 
is simply xt = t. Then ji{t, y) corresponds to the expected payoff of arm y at time t, and the context metric 
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T^x(t, t') provides an upper bound on the temporal change \fi(t, y) — fi(t' : y)\. We term it the drifting MAB 
problem. Interestingly, this problem incorporates significant constraints both across time (for each arm) and 
across arms (for each time); to the best of our knowledge, such MAB models are quite rare in the literature H 
Notable special cases of T>x include T>x(t, t') = a\t — 1'\ and T>x(t, t') = a\/\t — t'\, which corresponds to, 
respectively, the bounded change per round and the high-probability behavior of a random walk. We derive 
provable guarantees for these two examples, and show that they are essentially optimal. 

Interestingly, the contextual MAB problem subsumes the stochastic sleeping bandits problem ||23l . 
where in each round some arms are "asleep", i.e. not available in this round. Each context arrival xj 
corresponds to the set of arms that are "awake" in this round. More precisely, contexts x correspond to 
subsets S x of arms, so that only the context-arm pairs (x, y), y G S x are feasible, and the context distance is 
T>x — 0. Moreover, the contextual MAB problem extends the sleeping bandits setting by incorporating sim- 
ilarity information on arms. The contextual zooming algorithm (and its analysis) applies, and is geared to 
exploit this additional similarity information. In the absence of such information the algorithms essentially 
reduces to the "highest awake index" algorithm in l23l . 

Other contributions. We turn our attention to the contextual MAB problem with adversarial payoffs (see 
Section [2] for the setup). We describe an algorithm that is geared to take advantage of "benign" sequence 
of context arrivals. It is in fact a meta-algorithm: given an adversarial bandit algorithm Bandit, and a 
regret bound for that algorithm, we present a contextual bandit algorithm Context ualBandit which 
calls Bandit as a subroutine^ The crucial feature of our algorithm is that it maintains a partition of the 
context space that is adapted to the context arrivals. Again, we develop provable guarantees that capture 
the "benignancy" of the context arrivals. In the worst case, the contextual regret is bounded in terms of the 
covering dimension of the context space (and the regret bound for Bandit). In particular, we match the 
guarantee © for the uniform algorithm. 

One inherent drawback of MAB problems on metric spaces is that they require the entire metric space to 
be revealed to the algorithm, whereas in reality only coarser similarity information may be available. To this 
end, we consider a version of the stochastic MAB problem where the similarity information is given by a 
tree-shaped taxonomy on arms, without an explicitly specified similarity distance^ However, the taxonomy 
implicirfy defines a similarity distance as follows: the distance between two arms is equal to the maximal 
variation in expected payoffs in their least common subtree. We provide an algorithm for this setting, 
called TaxonomyBandit. Under appropriate assumptions about the "quality" of the taxonomy, we obtain 
guarantees similar to those for the setting where the induced similarity metric is explicitly revealed to the 
algorithm. The novel technical issue here is that apart from the familiar exploration-exploitation trade-off 
regarding payoffs there is now a similar exploration-exploitation trade-off regarding the induced similarity 
distance. The two trade-offs need to be handled jointly, which leads to significant technical complications. 
Obviously, TaxonomyBandit can be used as a subroutine for the contextual meta-algorithm from the 
previous paragraph. We conjecture that the same technique can be used to extend the contextual zooming 
algorithm to a setting where both the context space and the arms space are given via taxonomies of suitable 
"quality", with essentially matching guarantees. 

2 The only other MAB model with this flavor that we are aware of, found in Hazan and Kale 1 17], combines linear payoffs and 
bounded "total variation" of the cost functions. The latter is, essentially, an aggregate bound on the temporal change. 

3 Apart from using the corresponding "naive" algorithm, our presentation allows us to leverage prior work on other adversarial 
MAB formulations, such as the basic fc-arm version |4|, linear payoffs |28, 6 14, 1 17| and convex payoffs I20II15I . 

4 This setting has been introduced in 1301 . an experimental paper with a web advertising motivation, and later studied in 12511291 . 
The guarantees in this paper are incomparable with the ones in 1 25 29]: they are much stronger, but in a more tractable model. 
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Organization of the paper. The contextual zooming algorithm is presented in Section [3] (see Section I3T21 
for the application to the drifting MAB problem). Lower bounds are in Section @] The meta-algorithm 
for adversarial payoffs is presented in Section [5] The taxonomy -based similarity information is treated in 
Section [6] 

2 Preliminaries 

We will use the notation from the Introduction without further notice. In particular, x t will denote the i-fh 
context arrival, i.e. the context that arrives in round t, and y t will denote the arm chosen by the algorithm in 
that round. We will use xri_ T \ to denote the sequence of the first T context arrivals (x\ , . . . , xt)- 

Metric spaces. Fix a metric space. A ball with center x and radius r is denoted B(x,r). Formally, 
we will treat a ball as a (center, radius) pair rather than a set of points. Let S be a set of points. The r- 
covering number of S is the smallest number of sets of diameter r that is sufficient to cover S. The covering 
dimension of S (with constant c) is the smallest d such that the r-covering number is at most cr~ d for each 
r > 0. The doubling constant of S is the smallest c DB l such that for any 5 6 (0, 1), any subset S' C S of 
diameter r can be covered by c^° gS sets of diameter Sr. The covering dimension (with constant 1) is at 
most logc DB L> but can be much smaller in general. The following fact is well-known: if distance between 
any two points in S is > r, then any ball of radius r contains at most cj?, BL points of S. 

A function / : X — ► R if a Lipschitz function on a metric space (X, V) if the following Lipschitz 
condition holds: \f(x) — f{x')\ < T>(x, x') for each x, x' G X. 

Adversarial payoffs. Let us set up the contextual MAB problem with adversarial payoffs and support 
T . The main distinction is that now in each round t, the payoff function ir t is sampled from a time-specific 
distribution LT^ over the set T of all possible payoff functions. Fixing the time horizon T and denoting 
Ht{x, y) — E[7Tt(x, y)], the context-specific best arm is defined as 

y* t (x) G argmax^y Y%=i A*t («>!/)> ( 3 ) 
where the ties are broken in an arbitrary but fixed wayH The contextual regret is defined as 

R ( T ) ~ Et=i Vt(x t ,y t ) -Ht&t), where (j$(x) = fi t (x t ,y*(x t )). 

Analogously to (Q]), for each round t the expected payoffs function p, t is a Lipschitz function on the similarity 
space. Moreover, the benchmark payoff 11% is a Lipschitz function on the context space. 

Accessing the metric spaces. We assume full and computationally unrestricted access to the similarity 
information. While the issues of efficient representation thereof are important in practice, we believe that a 
proper treatment of these issues would be specific to the particular application and the particular similarity 
metric used, and would obscure the present paper. One clean formal way to model this issue is to assume 
oracle access to the metric spaces in question, e.g. via an oracle that inputs a collection of balls and a set, 
and outputs an arbitrary point in the set which is not covered by these balls, if there is any. 

Fixed vs arbitrary time horizon. There is a well-known doubling trick which converts a bandit algo- 
rithm ALG with a fixed time horizon to one that runs indefinitely and achieves essentially the same regret 
bound for every given round. Namely, in each phase i = 1, 2, 3, . . ., run a fresh instance of ALG for 2 l 
rounds. With this trick in mind, we will assume fixed time horizon throughout this paper. 

5 The max in {3]l is attained by some y 6 Y as a maximum of a Lipschitz function on a compact metric space. 
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3 The contextual zooming algorithm 



In this section we consider the (stochastic) contextual MAB problem. We present an algorithm for this 
problem, called the contextual zooming algorithm, which takes advantage of both the "benign" context 
arrivals and the "benign" expected payoffs by "zooming in" on both the "popular" regions on the context 
space and the high-payoff regions of the arms space. The algorithm considers the context space and the 
arms space jointly: it adaptively maintains a partition of the similarity space. 

We will consider a model that generalizes the one from the Introduction in the following two directions. 
First, not all context-arm pairs (x, y) will be allowed: there will be a set V C X x Y of feasible context- 
arms pairs. Using the notation [i(x,y) = fit{x,y) for the expected payoff function, the benchmark payoff 
is ij*(x) = sup y . (x,y)£-p /J-( x , y)- Second, instead of the context space and the arms space there will be a 
metric space (V, V), revealed to the algorithm, such that the following Lipschitz conditions hold: 

\^{x,y) - v{x ,y')\ <V((x,y), (x',y')), (4) 

This (V, V) is called the similarity space. This will be the only metric space that appears in this section. 

Our algorithm maintains a covering of the similarity space with balls. We formulate the provable guar- 
antees in terms of (essentially) the maximal number of balls in such a covering, under various restrictions 
that are implicitly enforced by the algorithm. 

Definition 3.1. Consider an instance of the contextual MAB problem. The r-zooming number N{r) is 
defined as the r-covering number of the set V T = {(x, y) G V : p*{x) — y) < 12 r}. Given a constant 
c > 0, the contextual zooming dimension is the smallest d > such that N(r) < cr~ d for all r > 0. 

It is easy to see that the contextual zooming dimension is bounded from above by the covering dimen- 
sion, but can be much smaller for "benign" expected payoffs. To take into account "benign" context arrivals, 
a more complicated setup is needed (without any modification to the algorithm), see Section l3TTl for details. 

Theorem 3.2. The contextual regret of the contextual zooming algorithm is 

R(T) < inf ro>0 O (r T + E r=2 -i:ro<r-<i ^(r)log(T)) < O^T 1 " 1 /^) logT ) j (5) 
where N(r) is the r-z.ooming number and d is the contextual zooming dimension with any given constant c. 

Note that the bounds in Theorem 13 .21 hold for each c > 0. This is useful since different values of c may 
result in drastically different values of the c-zooming dimension. 

The contextual zooming algorithm. The algorithm is parameterized by the time horizon T. In each round 
t, it maintains a finite collection of balls (called active balls) which collectively cover the similarity space. 
Adding active balls is called activating; balls stay active once they are activated. Initially there is only one 
active ball which has radius 1 and contains the entire similarity space. 

In each round t, each active ball B is assigned a number It(B) called index, defined as follows. Let 
r{B) the radius of B, let nt(B) be the number of times it has been selected so far by the algorithm, and let 
vt{B) be the corresponding average payoff. (Define the latter to be if the ball has not been selected yet.) 
The pre-index of B is defined as 

ir(B)±v t (B) + 2r(B) + rzd t (B), where rad t (£) 4 4 yQagjj. (6) 
The quantity xa,d t {B) is called the confidence radius of B at time t. Finally, the index of B is 

I t (B) ^ min (ir(B') + V(B, B')) , (7) 



6 



where the minimum is over all balls B' of radius r(B') > r(B) that have been activated before time t, and 
the distance T>(B, B') between two balls B and B' is defined as the distance between their centers. 

The algorithm operates as follows. In each round t, the domain of an active ball B, denoted dom t (B), 
is a subset of B defined as B minus the union of all balls of strictly smaller radius that are active at time 
t. Ball B is called relevant in round t if it is active and (xt,y) G dom t (B) for some arm y G Y. The 
algorithm picks an arm to play according to the following selection rule: it selects a relevant ball B with 
the maximal index (breaking ties arbitrarily) and an arbitrary arm y such that (xt,y) G dom t (B). Balls are 
activated according to the following activation rule: if an active ball B is selected in a round t such that 
r&d t (B) < r(B), then a ball B' with center (xt, yt) and radius \ r(B) is activated; B is then called the 
parent of B' . This completes the specification. 

Remark. In the algorithm, the "domains" and the activation rule are defined in a specific way to ensure 
that each active ball is centered in a point (xt,yt) such that in round t its parent ball is selected (rather 
than in some arbitrary point inside the parent ball). This is useful for the general setting (@]) considered 
in this section, whereas with some other reasonable versions of the activation rule we need an additional 
assumption that (essentially) the benchmark payoff pL* satisfies a Lipschitz property]! 

Claim 3.3. The following invariants are maintained: 

• (covering) in each round, active balls cover the similarity space, 

• (separation) for any two active balls of radius r, their centers are at distance at least r. 

Proof. The activation rule makes sure that the following property (*) holds: if a ball B(p,r) is activated 
in round t, then in any subsequent round t' it is covered by active balls of radius < r. (To show this, use 
induction on t' .) The covering invariant holds initially, and so by (*) it holds in every subsequent round. 

To show the separation invariant, suppose B and B' be two balls of radius r such that B is activated at 
time t, with parent 73 par , and B' is activated before time t. The parent ball B vm has radius 2r, so by property 
(*) its round- 1 domain is disjoint with B'. By the activation rule, the center of B is (x t ,yt), and by the 
selection rule (x t , yt) G dom t (5 par ). It follows that (x t ,yt) B'. □ 

Analysis. Throughout the analysis we will use the following notation. For a ball B with center (x, y), 
denote the radius as r(B), and define the expected payoff of B as p,{B) = p,(x, y). The badness of (x, y) is 
defined as A(x, y) = y*{x) — /j,(x, y). Let Bf el be the active ball selected by the algorithm in round t. 

Claim 3.4. If ball B is active in round t, then with probability at least 1 — T~ 3 we have that 

\v t {B)-p{B)\<r(B)+rzd t {B). (8) 

Proof. Fix ball V with center (x,y). Let S be the set of rounds s < t when ball B was selected by the 
algorithm, and let n = \S\ be the number of such rounds. Then ft(B) = - J2 S £S n s{x s , ys)- 

Define Zk = (^(^sjI/s) — mO^sj Vs))-> where the sum is taken over the k smallest elements s £ 5. 
Then {Zk} is a martingale with bounded increments, so by the Azuma-Hoeffding inequality with probability 
at least 1 - T~ 3 we have that -\Z n \ < rad t (B). Note that \p,(x s ,y s ) - (i(B)\ < r(B) for each s G S, so 
\vt(B) — n(B)\ < r(B) + i \Z n \, which completes the proof. □ 

Call a run of the algorithm clean if © holds for each round. From now on we will focus on a clean run, 
and argue deterministically using ([8]>. The heart of the analysis is the following lemma. 

Lemma 3.5. Consider a clean run of the algorithm. Then A(xt, yt) < 15 r(Bf el ) in each round t. 
6 This assumption is satisfied for a narrower setting ifUl defined in the Introduction. 
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Proof. Fix round t. By the covering invariant, (xt, y*(xt)) £ B for some active ball B. Recall from © 
that I t (B) = I pie (B') + V(B, B') for some active ball B' of radius r(B') > r{B). Therefore 

I t (Bf el ) > I t (B) = I pre (B f ) + V(B, B') (selection rule, defn of index ©) 

= v t (B') + 2r{B') + radt(S') + V(B, B') (defn of preindex ©) 

> fj,(B') + r(B) + V(B, B') (ClaimEland r(B') > r(B)) 

> fJ,(B) + r(B) > (x(xt, y*(xt)) = H*{x t ). (Lipschitz property (0]), twice) (9) 

On the other hand, letting B pw be the parent of Bf el and noting that by the selection rule 

rad t (£ par ) < r(5 par ) = 2r(B? el ), (10) 

we can upper-bound I t (Bf el ) as follows: 

h (Bf el ) < I pre (B par ) + r(5 par ) (defn of index ©) 

= u t (S par ) + 3 r (5 par ) + rad 4 (B par ) (defn of preindex ©) 

< ^(5 pai ) + 4r(£ par ) + 2radi(5 par ) (Claim[H 

< /i(5 par ) + 12 el ) ("parenthood" fl©) 

< (Ji{xt,yt) + 15 r(Bf el ) (Lipschitz property ©). (11) 

In the last inequality we used the fact that (xt,yt) is within distance 3r(Bf el ) from the center of B par . 
Putting the pieces together, n*{x t ) < h(B? el ) < fx(x t , y t ) + 15 r(B? el ). □ 

Corollary 3.6. In a clean run, if ball B is activated in round t then A(xt,yt) < 12 r(B). 

Proof. By the activation rule, -Bf el is the parent of B. Thus by Lemma l331 we immediately have A(x t: yt) < 
15 r(Bf el ) = 30 r(B). To obtain the constant of 12 that is claimed here, it suffices to prove a more efficient 
special case of Lemma l3.5[ if radi(i?f el ) < r(Bf el ) then A(xt, yt) < 6r(Bf Bl ). To prove this, we simply 
replace (fTTT) in the proof of Lemma [331 by similar inequality in terms of I pre (Bf el ) rather than I pm (B p31 '): 

It(Bf el ) < I pre (B? el ) = v t (Bf el ) + 2 r(B? el ) + rad t (Ef x ) (defns dSE)) 

< ^(Br 1 ) + 3r(5f el ) + 2radi(Bf el ) (Claim[H 
<^(x t ,y t ) + 6r(S t sel ) □ 

Now we are ready for the final regret computation. 

Proof of Theorem I3.2t For a given r = 2~\ i £ N, let T r be the collection of all balls of radius r that 
have been activated throughout the execution of the algorithm. A ball B £ T r is called full in round t if 
rad t (f?) < r. Note that in each round, if a full ball is selected then some other ball is activated. Thus, 
we will partition the rounds among active balls as follows: for each ball B G T r , let Sb be the set of 
rounds which consists of the round when B was activated and all rounds t when B was selected and not 
full. It is easy to see that |Sb| < 0(r~ 2 logT). Moreover, by Lemma [331 and Corollary 13.61 we have 
A(xt, y t ) < 15 r in each round t £ Sb- 

If ball B £ JT r is activated in round t, then by the activation rule its center is (xt, yt), and Corollary 13.61 
asserts that (x t ,yt) £ V r , as defined in Definition 13. II By the separation invariant, the centers of balls in 
T r are within distance > r from one another. It follows that these centers can be covered by N(r) sets of 
diameter r, where N(r) is the r-zooming dimension of the problem instance, and so \T r \ < N(r). 
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Fixing some r$ > 0, note that in each rounds t when a ball of radius < tq was selected, regret is 
A(xt,yt) < 0{ro), so the total regret from all such rounds is at most 0(tq T). Therefore, contextual regret 
can be written as follows: 

R(T) = EL A(x t ,y t ) = O(r T) + E r > ro Eb«=*. EteS B *(?t,Vt) 
<O(r T) + Z r ZB^ r \S B \O(r) 

<o(r r + E r > ro ^(r)io g (r)), 

which proves the first inequality in (f5]). The second one is obtained by setting r = T l ^ d+2 \ □ 
3.1 Improved guarantees for the contextual zooming algorithm 

Let us develop provable guarantees for contextual zooming algorithm that take into account both the ex- 
pected payoffs and the context arrivals xn.TV Instead of the covering numbers used in Theorem 13.21 and 
Theorem 15.11 we will use the dual packing numbers. An r -packing of a set S in a metric space is a subset 
S' C S such that any two points in S' are distance > r apart. Note that the size of any r-packing of S is 
bounded from above by the r-covering number of S. We will consider the r-packings of the set 

V r = {(x,y) A(x,y) < 18 r}, 

same set is in Definition 13.11 To take into account t), we will count only the r-packings of V r that are 
"plausible" given xn.T)- 

Definition 3.7. A context x relevant to a ball B if there exists an arm y such that (x, y) G B. A set S C V 
is plausible w.r.t. £(i..t) if eac h point p G 5 can be mapped to a subset </> p C xm.t) °f at l east r 2 context 
arrivals that are relevant to B(p, r) so that the sets {cf) p : p G S} are disjoint. The (r, /j,)-packing number is 
the largest size of an r-packing of V T that is plausible w.r.t. xn_x)- 

The idea is that if S is an r-packing of V r that is not plausible then we can rule out an execution of 
contextual zooming algorithm such that each ball B(p, r), p G S is activated and then deactivated. Now, it 
is easy to modify the analysis in Section [3] so that instead of counting the balls of radius r that are activated 
by the algorithm, we count the balls of radius r that are activated and deactivated. This suffices because 
each active ball is a child of some ball that have been deactivated, and we bound the number of children 
using the doubling constant of the similarity space. We obtain the following theorem. 

Theorem 3.8. Consider the contextual MAB problem with time-invariant payoffs. Then, letting N(r) be the 
(r, fi) -packing number of the problem instance, the contextual zooming algorithm achieves regret 

R(T) < 0(c DBL ) inf ro>0 (r T + E r=2 -i : ro < r <i \ N(r) log(T)) , (12) 
where c DBL is the doubling constant of the similarity space. 

It is easy to see that the (r, ^-packing number from Theorem 13.81 is bounded from above by the r- 
zooming number from Theorem 13-21 so the former theorem provides stronger guarantees. Just like in Theo- 
rem !3.2[ one can define a version of the contextual zooming dimension based on the (r, /i) -packing numbers, 
call it d, and show that R{T) < 0(c DBL yi-i/M log T). 
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3.2 Application to the drifting MAB problem 

Let us consider the drifting MAB problem. Recall that this is simply the contextual MAB problem in which 
the i-th context arrival is xt = t, the quantity p,(t, y) is the expected payoff of arm y at time t, and the 
context space (X, T>x) represents the bound on temporal change. Contextual regret becomes dynamic regret 
- regret with respect to a benchmark which in each round plays the best arm for this round. Typically, the 
covering numbers of the context space are proportional to the time horizon T, so the quantity of interest is 
the average dynamic regret R(T) = R(T) /T. 

The general provable guarantees are given in Theorem 13 .21 in terms of the r-zooming numbers, and the 
construction in Theorem 14. 1 1 provides the matching lower bound, up to poly log (T) factors (we omit the 
easy details). For some specific numeric examples, suppose T>x(t, t') = <J\/\t — t'\ and consider the worst- 
case dynamic regret for this context space. Then the r-covering number of the context space is 0(a/r) 2 . 
If the arms space consists of k arms with no similarity information, then from (0) it is easy to see that 
R{T) < 0(k a 2 log T) 1 ' ' 4 as long as T > T , where T = n(crv4) _1 .@ More generally, if d Y is the 
covering dimension of the arms space, with constant cy, then R(T) < 0(cy cr 2 logT) 1 ' ( 4 + d Yj as i on g as 
T > S1(ov/cy) . Similar bounds can be obtained, e.g., for T>x{t,t') = a\t — t'\. For all these examples, 
the construction in Theorem 14. 1 I provides the matching lower bounds up to polylog(T) factors. 

4 Lower bounds 

Let us formulate the lower bounds which essentially match the upper bound in Theorem |3.2| 

Theorem 4.1. Consider the contextual MAB problem with time-invariant payoffs. Fix the context space and 
the arms space. Let Nx(r) and Ny(r) be their respective r-covering numbers. Then for any r > and 
any N < Nx(r) Ny(r) there exists a family ofQ(N) problem instances such that any algorithm has regret 
R(N r~~ 2 ) > Q(N/r) on at least a half of these instances. If the context space and the arms space have 
doubling constant < c DBL , then an Q,{r)-zooming number of each problem instance is at most N. 

Remark. For time horizon T = N/r 2 the lower bound in Theorem 14. 1 1 matches the upper bound in Theo- 
rem [372] up to O(logT) factors (i) for every problem instance if the context space and the arms space are 
doubling, (ii) for a "worst-case" problem instance whose r-zooming number equals A x (r) N Y (r). Recall 
that the r-zooming number of each problem instance in Theorem 14. II is trivially at most Nx{r) Ay(r), the 
r-covering number of the similarity space (V, V) = (X x Y, D x + £>y)- 

Proof Sketch. For simplicity, let N = nx ny, where 1 < nx < Nx{r) and 2 < ny < Ny(r). By definition 
of the r-covering number, any r-net on the context space has size at least Nx(r). Let Sx be an arbitrary 
set of nx points from one such r-net. Similarly, let Sy be an arbitrary set of ny points from some r-net on 
the arms space. The sequence £(i„t) °f context arrivals is defined in an arbitrary round-robin fashion over 
the points in Sx- For each x E Sx we construct a standard needle-in-the-haystack lower-bounding example 
(from on the set Sy. Specifically, pick one point y*{x) G Sy (but the algorithm does not know which), 
and define p,(x, y*{x)) = \ + §, and fi(x, y) = | + | for each y G SV\{y*(x)}. We smoothen the expected 
payoffs so that for the context-arms pairs that are far away from Sx x Sy the expected payoff is |, and the 
Lipschitz condition £T|) holds: 

fx(x, y) = max IoeSxi yoe5Y max ( |, /x(x , yo) - V x (x, x ) - V Y (y, yo) ) . 

It is easy to see that on such problem instance, all points of badness at most | can be covered by N balls 
of radius |, hence the claimed bound on the r'-zooming number for a suitable r' = O(r). Using the 

7 Note that one can replace the log(T) factor in dynamic regret by a log(fca) factor by restarting the algorithm every To rounds. 
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KL-divergence technique from (4l |22l |24|, one can show that by the time T = N/r 2 each context in Sx 
contributes 0(r)(ny — 1) to regret. Essentially, this is because for each x G Sx the algorithm needs to try 
each arm y G Sy at least Q(r~ 2 ) times, incurring regret Q(r) for each try of each y ^ y*(x). □ 

5 Adversarial payoffs: the contextual meta-algorithm 

In this section we consider the contextual MAB problem with adversarial payoffs. We describe an algorithm 
that is geared to take advantage of "benign" ("low-dimensional") sequence of context arrivals. It is in fact a 
meta-algorithm: given an adversarial bandit algorithm Bandit, we present a contextual bandit algorithm 
ContextualBandit which calls Bandit as a subroutine. The crucial feature of our algorithm is that it 
maintains a partition of the context space that is adapted to the context arrivals. 

We consider the contextual MAB problem with a known support (Y, J 7 ). Our algorithm is parameterized 
by a regret guarantee for Bandit on instances with support (Y, J 7 ). For a more concrete theorem statement, 
we will assume that the convergence time^l of Bandit is at most Tb(r) = cyr~( 2+dy ) log(-) for some 
constants cy and dy that are known to the algorithm. In particular, an algorithm that (essentially) can be 
found in ll20l achieves this guarantee if dy is the c-covering dimension of the arms space and cy = 0(c 2+dY ). 

We will consider the context space (X, T>x) \ this is the only metric space that appears in the rest of this 
section. The term "ball" will refer to a ball in the context space. Recall that xnm denotes the sequence (or 
a multiset) of the first T context arrivals {x\ , . . . , xt)- Let c DB l = c D bl v x (i..t)) T^x)- 

We quantify the "goodness" of the context sequence in terms of the covering number of xn.T) m tne 
context space, where T is the time horizon. In fact, we will use a more permissive notion that allows a 
limited number of "outliers": the (r, k)-covering number of a multiset S is the r-covering number of the set 
{x G S : \B(x, r) fl S\ > k}$ One can naturally define a version of the covering dimension that uses the 
(r, fc)-covering numbers rather than the r-covering numbers. Specifically, given a constant c and a function 
k : (0, 1) — ► N, the relaxed covering dimension of S with slack fe(-) is the smallest d > such that the 
(r, k (r)) -covering number of S is at most cr~ d for all r > 0. 

Theorem 5.1. Consider the contextual MAB problem with adversarial payoffs and T. Let Bandit be an 
algorithm for the adversarial MAB problem whose convergence time on problem instances with support T 
is at most Tq{t) = cyf _ ^ 2+dY ' log(^) /or some constants cy and dy that are known to the algorithm. Then 
ContextualBandit achieves contextual regret, for any time T and cx > 0, 

R(T) < 0(c 2 BL ) (c x c Y ) 1/{2+dx+dY) T^/^+^+^OogT), (13) 

where c DBL is the doubling constant of £(i..t)> an d dx is the relaxed covering dimension ofxn j>\ with slack 
To(-) and constant cx- 

Remark. The guarantee ( [TBI should be contrasted with the one for the "naive" algorithm ©: the difference 
is that (fT3l uses a more "permissive" notion of dimensionality for the context space. For a version of (fT3l) 
that is stated in terms of the "raw" (r, k r ) -covering numbers of X(i__ T \, see ( fT3T ) in the analysis (page [TBI. 

Remark. For a bounded number of arms, algorithm EXP3 [4] achieves dy = and Cy = 0(y/\Y\). For 
linear payoffs - problem instances whose support T consists of linear functions on Y, where Y is identified 
with a convex subset of M. d - there exist algorithms with d\ = and cy = poly (d) lTT4l [TJ. Likewise, for 
convex payoffs there exist algorithms with dy = 2 and cy = 0(d) lfl5l . 

8 The r-convergence time Tb(r) is the smallest To such that regret is R(T) < rT for each T >Tq. 

9 By abuse of notation, here \B(x, r) n S\ denotes the number of points x 6 S, with multiplicities, that lie in B(x, r). 
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The algorithm. The contextual bandit algorithm ContextualBandit is parameterized by an adver- 
sarial MAB algorithm Bandit, which it uses as a subroutine, and a function Tq(-) : (0, 1) — > N. 

The algorithm maintains a finite collection of balls, called active balls. A ball stays active once it is 
activated. Initially there is one active ball of radius 1 which contains the entire context space. Some active 
balls will be designated full, as defined below. 

In each round t, one ball may be activated according to the following activation rule. Suppose all active 
balls that contain x% are full. Among these balls, pick the one with the smallest radius (break ties arbitrarily), 
call it B. Activate the ball B(xt, §), where r is the radius of B, and call it a child of B. 

Let us say that a ball B is hit in round t if (after the activation rule is called) it is active, not full, contains 
Xt, and among all such balls it has been activated last (any other tie-breaking rule can be used as well). A 
ball of radius r is full if it is active and it has been hit Tq(t) times. 

Once ball B is activated, we initialize a fresh instance (Ab) of algorithm Bandit whose "arms" are 
indexed by points in Y. In each round, if B is the ball that is hit, algorithm Ab is called to select an arm 
y G Y to be played, and the resulting payoff is fed back to Ab- This completes the specification. 

Claim 5.2. The algorithm satisfies the following basic properties: 

(a) (Correctness) In each round t, exactly one active ball is hit. 

(b) Each active ball of radius r is hit at most To(r) times. 

(c) (Separation) For any two active balls B(x, r) and B(x' , r) we have T>x(x, x') > r. 

(d) Each active ball has at most Cp BL children, where c D bl is the doubling constant ofxn.T)- 

Proof. For (a), it suffices to show that after the activation rule is called, there exists an active non-full ball 
containing x t . For the sake of contradiction, suppose such ball does not exist. Then it was the case even 
before the activation rule has been called. Thus, the activation rule did activate some ball B centered at xt, 
as a child of some other ball B' of twice the radius. The radius of B' must be the smallest among all balls 
that in round t are full and contain xt- It follows that B is not full in round t (and therefore not active). 

For (b), simply note that by the algorithms' specification a ball is hit only when it is not full. 

To prove (c), suppose that Vx{x,x') < r and suppose B(x',r) is activated in some round t while 
B(x, r) is active. Then B(x', r) was activated as a child of some ball B* of radius 2r. On the other hand, 
x' = xt G B(x, r), so B(x, r) must have been full in round t (else no ball would have been activated), and 
consequently the radius of B* is at most r. Contradiction. 

For (d), consider the children of a given active ball B(x, r). Note that by the activation rule the centers 
of these children are points in £(i..t) H B(x, r), and by the separation property any two of these points lie 
within distance > | from one another. By the doubling property, there can be at most Cp BL such points. □ 

In the remainder of the section we prove Theorem 15.11 Let us fix the time horizon T, and let R{T) 
denote the contextual regret of ContextualBandit. Partition R{T) into the contributions of active balls 
as follows. Let B be the set of all balls that are active after round T. For each B G B, let Sb be the set of all 
rounds t when B has been hit. Then 

R(T) = T,seB Rb{T), where R B (T) 4 ^s B tifa) ~ (k(?t, Vt)- 
Claim 5.3. For each ball B = B(x, r) G B, we have Rb < 3 r 7b(r). 
Proof. By the Lipschitz conditions on p, t and \i* t , for each round t G Sb it is the case that 

f4( x t) <r + p* t (x) = r + Ht(x,y*{x)) < 2rn + p, t {x t , y*(x)). 
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The i-round regret of Bandit is at most Ro(t) = tT 1 (t). Therefore, letting n = \Sb\ be the number of 
times algorithm A b has been invoked, we have that 

R o(n) + EtGS s fhfa* Vt) > T,teS B Vt(xt,y*(x)) > J2teS B Vt( x t) ~ 2rn. 

Therefore Rb(T) < Ro{n) + 2rn. Recall that by Claim IBT^l b) we have n < Tq{t). Thus, by definition of 
convergence time Ro(n) < Ro(T (r)) < r T (r), and therefore R B {T) < 3 r T (r). □ 

Let T r be the collection of all full balls of radius r. Let us bound |J>| in terms the (r, /c)-covering 
number of xi\__j>) i n the context space, which we denote N(r, k). 

Claim 5.4. There are at most N(r, To(r))full balls of radius r. 

Proof. Fix r and let k = 7b(r). Let us say that a point x S £(i..t) is heavy if B(x, r) contains at least k 
points of £(i..t) s counting multiplicities. Clearly, B{x,r) is full only if its center is heavy. By definition 
of the (r, fc)-covering number, there exists a family S of N(r, k) sets of diameter < r that cover all heavy 
points in xn ^y For each full ball B = B(x, r), let 5# be some set in S that contains x. By Claim ISl2l c). 
the sets Sb, B G T t are all distinct. Thus, \T T \ < \S\ < N(r, k). □ 

Proof of Theorem l5.lt Let B r be the set of all balls of radius r that are active after round T. By the algo- 
rithm's specification, each ball in T r has been hit To(r) times, so \T r \ < T/To(r). Then using Claim [5^2l b) 
and Claim [5~4l we have 

l»r/2l < cLl l^rl < c dbl min(T/r (r), N(r, T (r))) 
Efl e s p/3 ^B < O(r)T (r) \B r/2 \ < 0( C 2 BL ) min(rT, rT (r)N(r,T (r))). (14) 

Trivially, for any full ball of radius r we have 7b (r) < T. Thus, summing (fl4l) over all such r, we obtain 

R (T) < 0(C 2 DBL ) E,=2-^ e NandT (r)<T min (^ r T (r) N (r,T (r))) . (15) 

Note that (fT3T > makes no assumptions on N(r, 7b (r)). Now, plugging in 7b(r) = cy r~( 2+a!Y ) and JV(r, 7b(r)) < 
cx ^ _dx into (fT31) and optimizing it for r it is easy to derive the desired bound (fT3l . □ 

6 Implicit similarity information: the taxonomy MAB problem 

In this section we consider a setting where the similarity information is given by a topical taxonomy on 
arms. A tree-shaped taxonomy on arms implicitly defines a similarity distance as follows: the distance 
between two arms is equal to the maximal variation in payoffs in the least common subtree. We obtain 
guarantees similar to those for the Lipschitz MAB problem l24l in which the induced similarity metric is 
explicitly revealed to the algorithm. On a technical level, the crucial issue is the more complex version of 
the exploration-exploitation trade-off: taking sufficiently many random samples from a given subtree reveals 
not only the corresponding expected payoff, but also information about the implicit distances. 

The setting: taxonomy MAB problem. Consider the (context-free) MAB problem with time-invariant 
random payoffs. Let X be the set of arms, and let \i : X — > [0, 1] be the expected payoffs: n(x) is 
the expected payoff of arm x. An algorithm is given a taxonomy T on X, which is simply a rooted tree 
whose leaf set is X. For an internal node u S T let T(v) be the subtree rooted at v, and let X(v) be the 
corresponding set of leaves (i.e., arms). The weight of v is defined as wgt(u) = swp x yeX f v ^ \n(x) — n(y)\. 
The taxonomy is seen as an implicit representation of the following natural similarity distance (X, V) : for 
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any x, y G X we define V(x, y) as the weight of their least common ancestor. Note that \i is a Lipschitz 
function on (X, V). 

The quality of the taxonomy is expressed as follows. Consider the random walk on a directed version 
of T oriented away from the root. Let V(v,u) be the probability that this random walk reaches a node u 
starting from v. A sample from this distribution is called a random sample from T(v). Define the expected 
payoff from the subtree T(v) as that of the random sample: fi(v) = YlxeXfv) M x ) V(v,x). Then the 
quality of the taxonomy is the largest number q such that the following condition holds: 

• for each subtree T(v) containing an optimal strategy, there exist internal nodes u, u' G T(v) such that 
\n{u) — fi(u')\ > wgt(v)/2, and moreover mm(V(v, u), V(v, u') ) > q. 

Remark. Here we have a more complex exploration-exploitation trade-off: taking sufficiently many random 
samples from a given subtree T{y) reveals not only its expected payoff fi(v) but also its weight wgt(w). 
Accordingly, both payoffs and weights are subject to the trade-off. 

Provable guarantees. Let us define the zooming dimension of the problem instance (with constant c) as 
that of the corresponding instance (X,T>,p,) of the Lipschitz MAB problem, as defined in ll24l . In the 
present setting, it is the smallest number d such that the following covering property holds for each 5 > 0: 

• the set X$ = {x G X : \i* — fi(x) < 5} can be covered with c 5~ d subtrees whose weight is < 5/8. 

Recall that it is upper-bounded by the covering dimension, but takes into account the expected payoff 
function. In particular, it is not impacted by a subtree with high covering dimension but low payoffs. 

We provide an algorithm TaxonomyBandit which, for a large enough time horizon, matches the 
guarantee in 11241 for the corresponding instance (X, V, /i) of the Lipschitz MAB problem up to polylog(T) 
factors. 

Theorem 6.1. Consider the taxonomy MAB problem on a constant-degree taxonomy. Then for any c > 
algorithm TaxonomyBandit achieves regret R(T) < 0(|) 1/(2+d) T 1 " 1 /( 2+ ^ \og(T)forany T > 2 x l q , 
where d is the zooming dimension with constant c, and q is the quality of the taxonomy. 

Algorithm TaxonomyBandit. The algorithm is parameterized by the time horizon T and a quality 
parameter q, a pessimistic (lower) bound on the quality of the taxonomy^ In each round the algorithm 
selects one of the tree nodes, say v, and plays a randomly sampled strategy x from T{v). We say that a 
subtree T(u) is hit in this round if u G T{v) and x G T(u). For each tree node v G V and time t, let nt(v) 
be the number of times the subtree T(v) has been hit by the algorithm before time t, and let /Xt('u) be the 
corresponding average reward. Note that E[p,t(v)] = ^{v) if nt{v) > 0. Define the confidence radius of v 
at time t as 

rad t (v) 4 ^8 log(T)/{2 + n t (v)). (16) 

The meaning of the confidence radius is that \m{v) — fJ,(v)\ < rad t (w) with high probability. 
For each tree node v and time t, let us define the index of v at time t as 

I t (v)± nt(v) + {1 + 2 k A )rad t (v), where k A ± \^2ft. (17) 

Here we posit Utiy) = if nt{v) = 0. Let us define the weight estimate 

wgt t (u) = max max(0, \ih(u\) - Ih{u 2 )\ - rad t (ui) - rad t (n 2 ) ) . (18) 

ui,U2£T(v) 

10 The dependency on T can be removed via the doubling trick, as described in the Preliminaries. The dependency on q can be 
removed as follows: in each phase i run a fresh instance of TaxonomyBandit for 2 1 steps, with q = 
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Throughout the phase, some tree nodes are designated active. We maintain the following invariant: 

wgt t (u) < k A rad((f ) for each active internal node v. (19) 

TaxonomyBandit operates as follows. Initially the only active tree node is the root. In each round: 

(51) If ( fl"9l is violated by some v, de-activate v and activate all its children. Repeat until ( fl~9l holds. 

(52) Select an active tree node v with the maximal index (fTTT ). breaking ties arbitrarily. 

(53) Play a randomly sampled strategy from T{v). 

It is easy to see that in each round after step (SI) the invariant ( fT9l holds. 

Analysis. We identify a certain high-probability behavior of the system, and argue deterministically under 
assumption that the system follows this behavior. 

Definition 6.2. An execution of TaxonomyBandit is called clean if for each time t < T and each tree 
node v € V the following two properties hold: 

(PI) \[J.t{v) — n(v)\ < ra.d t (v) as long as n t {v) > 0. 

(P2) for each tree node u G T(v ) we have 

n t (v)V(v,u) > 81ogT =^ n t {u) > ^n t (v)VT(v,u). 
Claim 6.3. An execution o/TaxonomyBandit is clean with probability at least 1 — T~ 2 . 

From now on we will argue about a clean execution. Note that in a clean execution, wgt(f ) > wgt t (u). 
Let [i* = maxa-gx \x(x) be the maximal reward. A strategy x G X is optimal if fi(x) = p,*. For each 
tree node v, define the badness of v as A(v) = fi* — n(v). Note that 

n{v) < ^{u) + wgt(u) for any tree node u G T(v ). (20) 

The crux of the proof is that at all times the maximal index is at least /j,*. 

Lemma 6.4. Consider a clean execution of algorithm A(T). Then the following property holds: 

• in any round t < T, at any point in the execution such that the invariant di9D holds, there exists an 
active tree node v* such that It(v) > fi*. 

Proof. Fix any optimal strategy x* , and any tree node v* such that x* G T(v*). Let us denote A = wgt(t>*). 
By definition of the parameter q, there exist internal nodes v$,v\ G T{v*) such that V(v*, Vj) > q, j G 
{0, 1}, and moreover \p{v\) — h(vq)\ > A/2. 

Let us define /(A) = 8 3 log(T) A -2 . Then for each tree node v and each A > we have 

radt(u) < A/8 ^ n t (v) > /(A). 

Now, for the sake of contradiction let us suppose that n t {v*) > (i k^) 2 /(A). This is equivalent to 
A > 2 k^ radt(i)*). Note that n t (v*) > (2/q) /(A) by our assumption on k^, so by property (P2) in the 
definition of the clean execution, for each node Vj, j G {0, 1} we have nt(vj) > /(A), which implies 
radt(vj) < A/8. Therefore (fT8T ) gives a good estimate of vgt(v*), namely wgt t (u*) > A/4. It follows 
that wgt t (v*) > k_A rad t (w*), which violates the invariant (fT9l . 

We proved that wgt(v*) < 2 k^ rad t (w*). Finally, using (1201 we have 

A(v*) < wgt(v*) < 2k A rad t (v*) 

It(v*) > li{v*) + 2 k A radt(^) > p*, (21) 

where the first inequality in (|2~TT) holds by definition of the index (fT71) and by property (PI) in the definition 
of the clean execution. □ 
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We use Lemma l6~4l to show that, essentially, there are not too many strategies with large badness A(-), 
and each strategy is not played too often. For each tree node v, let N(v) be the number of times node v was 
selected in step (S2) of the algorithm. Call v positive if N(v) > 0. We partition all positive tree nodes and 
all deactivated tree nodes into sets 

Si = {tree nodes v : N(v) > and T { < A(v) < 2" i+1 }, 

= {deactivated tree nodes v: 2~* < 4wgt(v) < 2~ l+1 }. 

Lemma 6.5. Consider a clean execution of algorithm A(T). 

(a) for each tree node v we have N(v) < 0(k^ logT) A~ 2 (v). 

(b) if node v is de-activated at some point in the execution, then A(v) < 4wgt(u). 

(c) \S*\<Ki = c 2( i + 1 ) dim ( c ) far each i. 

(d) \Si\ < 0(dr Ki + \) for each i. 

Proof. For part (a), fix an arbitrary tree node v and let t be the last time v was selected in step (S2) of the 
algorithm. By part (a), at that point in the execution there was a tree node v* such that It(v*) > p* . Then 
using the selection rule (step (S2)) in the algorithm and the definition of index (fTTT ). we have 

pf < It(v*) < h{v) < p(v) + (2 + 2 k A ) iaAt(v), 
A(v) < (2 + 2k A )rad t {v). (22) 
N(v) < n t {v) < 0(k\ logT) A~ 2 (w). 

For part (b), suppose tree node v was de-activated at time s. Let t be the last round in which v was 
selected. Then 

wgt(u) > wgt a («) > k A r s {v) > i (2 + 2k A )Ta.dt(v) > |A(«). (23) 

Indeed, the first inequality in (T23T > holds since we are in a clean execution, the second inequality in d23l holds 
because v was de-activated, the third inequality holds because n s (v) = n t (v) + 1, and the last inequality 
in <|23]> holds by d22b . 

For part (c), let us fix i and define I, = {a; £ I : A(x) < 2 _J+1 }. By definition of the c-zooming 
dimension, this set can be covered by K = c 2( i+1 ) dim ( c ) subtrees T(vi), . . . ,T(v K ), each of weight 
< 2~*/4. Fix a deactivated tree node v G S*. For each strategy x G X in subtree T(v) we have, by part (b), 

A(x) < A(«)+wgt(«) < 4wgt(u) < 2~ i+1 , 

so x G Xi and therefore is contained in some T{vj). Note that Vj G T(v) since wgt(w) > wgt(vj). It 
follows that the subtrees . . . , T{vk) cover the leaf set of T(v). 

Consider the graph G on the node set S* U {t^, . . . , vk}, where two nodes u, v are connected by a 
directed edge (u, v) if there is a path from u to v in the tree T. This is a directed forest of out-degree at least 
2, whose leaf set is a subset of {y\, . . . , vk}- Since in any directed tree of out-degree > 2 the number of 
nodes is at most twice the number of leaves, G contains at most K internal nodes. Therefore, < K, 
proving part (c). 

For part (d), let us fix i and consider a positive tree node u G Si. Since N(u) > 0, either u is active 
at time T, or it was deactivated in some round before T. In the former case, let v be the parent of u. In the 
latter case, let v = u. Then by part (b) we have 2~ l < A (it) < A(v) + wgt(u) < 4wgt(u), so v G <S* for 
some j < i + 1. 

For each tree node v, define its family as the set which consists of u itself and all its children. We have 
proved that each positive node u G Si belongs to the family of some deactivated node v G U*t : 1 1 S'*. Since 
each family consists of at most 1 + d? nodes, it follows that 

\Si\ < (1 + d r ) (EftlKj) < 0(d r K i+1 ). □ 
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Now we can bound the regret of TaxonomyBandit. 

Theorem 6.6. Consider the taxonomy MAB problem on a taxonomy of degree dq-. Then algorithm TaxonomyBandit, 
parameterized by time horizon T and the quality parameter q, achieves regret (for any c > 0) 

Ra{T) < 0(£|r )V(2+rf) r l-l/(2+d) 0^ (24) 
where d is the zooming dimension with constant c. 

Proof. The theorem follows from parts (a) and (d) of Lemma lo31 Let us fix c > and let d be the c-zooming 
dimension of the problem instance. Let us assume a clean execution. (Recall that by Claim [631 the failure 
probability is negligibly small.) Then: 

Z v esN(v) A(«) < 0{k\ log*) E, e5i < 0{k\ logt) \S % \ 2* < K2^^+ d ), 

where K = 0{c dq- k\ logt). For any 5 = 2 -i ° we have 

RA(t) < £ tree nodes „ A(«) = (E v: A( V )<* A(«)) + (Z„ ; A(,)> 5 ^(«) A(«)) 

<st+ (Ei< i0 E veSi n(v) a(v)) <5t + Ei< io ^2('+ 2 )d+ rf ) < « + o(jo (f )a-HO. 

We obtain (TJl by setting 5 = (^-) 1 /( d + 2 \ □ 

Acknowledgements. The author is grateful to Ittai Abraham, Bobby Kleinberg and Eli Upfal for many 
conversations about multi-armed bandits. 
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